GH-32609: [Python] Add type annotations to PyArrow by rok · Pull Request #47609 · apache/arrow

rok · 2025-09-20T19:53:21Z

NOTE: This PR is currently being split into multiple smaller ones. DO NOT MERGE.

This proposes adding type annotation to pyarrow by adopting pyarrow-stubs into pyarrow. To do so we copy pyarrow-stubs's stubfiles into arrow/python/pyarrow-stubs/, restructure them somewhat and add more annotations. We remove docstrings from annotations and provide a script to include docstrings into stubfiles at wheel-build-time. We also remove overloads from annotations to simplify this PR. We then add annotation checks for all project files. We introduce a CI check to make sure all mypy, pyright and ty annotation checks pass (see python/pyproject.toml for any exceptions).

PR introduces:

adds pyarrow-stubs into arrow/python/pyarrow-stubs/
fixes pyarrow-stubs to pass ty, mypy and pyright check
adds ty, mypy and pyright check to CI (crudely)
adds a tool (update_stub_docstrings.py) to insert annotation docstrings into stubfiles at wheel-build-time

GitHub discussion: A new home for pyarrow-stubs? #45919
GitHub Issue: [Python] Type checking support #32609

dangotbanned

Hey @rok, I come bearing unsolicited suggestions 😉

A lot of this is from 2 recent PRs that have had me battling the current stubs more

python/pyarrow-stubs/_compute.pyi

dangotbanned · 2025-09-30T18:38:21Z

python/pyarrow-stubs/compute.pyi

+def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ...
+
+
+def scalar(value: bool | float | str) -> Expression: ...


Based on

arrow/python/pyarrow/_compute.pyx

Lines 2859 to 2869 in 13c2615

@staticmethod

def _scalar(value):

cdef:

Scalar scalar

if isinstance(value, Scalar):

scalar = value

else:

scalar = lib.scalar(value)

return Expression.wrap(CMakeScalarExpression(scalar.unwrap()))

The Expression version (pc.scalar) should accept the same types as pa.scalar right?

https://github.com/rok/arrow/blob/24ec3c3b66b84d677caef02075e56703a7ad9d39/python/pyarrow-stubs/scalar.pyi#L400-L406

Ran into it the other day here where I needed to add a cast

https://github.com/narwhals-dev/narwhals/blob/cef3e0670ef2e208b3bfb071487c78de83b25e1f/narwhals/_plan/arrow/acero.py#L64-L65

I'm not sure what are you suggesting. Do you mean:

diff --git i/python/pyarrow-stubs/compute.pyi w/python/pyarrow-stubs/compute.pyi index df660e0c0c..f005c5f552 100644 --- i/python/pyarrow-stubs/compute.pyi +++ w/python/pyarrow-stubs/compute.pyi @@ -84,7 +84,7 @@ _R = TypeVar("_R") def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ... -def scalar(value: bool | float | str) -> Expression: ... +def scalar(value: Any) -> Expression: ...

Hmm, yeah I guess Any is what you have there so that could work.

But I think it would be more helpful to use something like this to start:
https://github.com/rok/arrow/blob/6a310149ed305d7e2606066f5d0915e9c23310f4/python/pyarrow-stubs/_stubs_typing.pyi#L50

PyScalar: TypeAlias = (bool | int | float | Decimal | str | bytes | dt.date | dt.datetime | dt.time | dt.timedelta)

Then the snippet from (#47609 (comment)) seems to imply pa.Scalar is valid as well.
So maybe this would document it more clearly?

def scalar(value: PyScalar | lib.Scalar[Any] | None) -> Expression: ...

python/pyarrow-stubs/acero.pyi

python/pyarrow-stubs/compute.pyi

python/pyarrow-stubs/_compute.pyi

dangotbanned · 2025-09-30T20:47:53Z

python/pyarrow-stubs/_compute.pyi

+    def name(self) -> str: ...
+    @property
+    def num_kernels(self) -> int: ...
+


#45919 (reply in thread)

I wonder if the overloads can be generated instead of written out and maintained manually.

Took me a while to discover this without it being in the stubs 😅

Suggested change

@property

def kernels(self) -> list[ScalarKernel | VectorKernel | ScalarAggregateKernel | HashAggregateKernel]:

I know this isn't accurate for Function itself, but it's the type returned by FunctionRegistry.get_function

If you wanted to be a bit fancier, maybe add some Generics into the mix?

@rok

look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time).

That would probably be more useful than the route I was going for here.

In python there's only the repr to work with, but there is quite a lot of information encoded in it

import pyarrow.compute as pc >>> pc.get_function("array_take").kernels[:10] [VectorKernel<(primitive, integer) -> computed>, VectorKernel<(binary-like, integer) -> computed>, VectorKernel<(large-binary-like, integer) -> computed>, VectorKernel<(fixed-size-binary-like, integer) -> computed>, VectorKernel<(null, integer) -> computed>, VectorKernel<(Type::DICTIONARY, integer) -> computed>, VectorKernel<(Type::EXTENSION, integer) -> computed>, VectorKernel<(Type::LIST, integer) -> computed>, VectorKernel<(Type::LARGE_LIST, integer) -> computed>, VectorKernel<(Type::LIST_VIEW, integer) -> computed>]

>>> pc.get_function("min_element_wise").kernels[:10] [ScalarKernel<varargs[uint8*] -> uint8>, ScalarKernel<varargs[uint16*] -> uint16>, ScalarKernel<varargs[uint32*] -> uint32>, ScalarKernel<varargs[uint64*] -> uint64>, ScalarKernel<varargs[int8*] -> int8>, ScalarKernel<varargs[int16*] -> int16>, ScalarKernel<varargs[int32*] -> int32>, ScalarKernel<varargs[int64*] -> int64>, ScalarKernel<varargs[float*] -> float>, ScalarKernel<varargs[double*] -> double>]

>>> pc.get_function("approximate_median").kernels [ScalarAggregateKernel<(any) -> double>]

rok · 2025-09-30T21:52:11Z

Oh awesome! Thank you @dangotbanned I love unsolicited suggestions like these! I am at pydata Paris right now so I probably can't reply properly until Monday, but given your experience I'm sure these will be very useful!

rok · 2025-10-02T14:16:05Z

Just a mental note: @pitrou suggested to look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time).

rok · 2025-10-24T22:53:38Z

@dangotbanned I got pyright, mypy and ty passing in CI with the following settings:

arrow/python/pyproject.toml

Lines 97 to 119 in c9608d2

    
           [tool.mypy] 
        
           files = ["pyarrow"] 
        
           exclude = 'pyarrow/interchange/.*|pyarrow/vendored/.*|pyarrow/tests/test_cuda*' 
        
           mypy_path = "$MYPY_CONFIG_FILE_DIR/pyarrow-stubs" 
        
           [tool.pyright] 
        
           pythonPlatform = "All" 
        
           pythonVersion = "3.10" 
        
           include = ["pyarrow"] 
        
           exclude = ["pyarrow/vendored", "pyarrow/interchange", "pyarrow/tests/test_cuda*"] 
        
           stubPath = "pyarrow-stubs" 
        
           typeCheckingMode = "basic" 
        
           [tool.ty.src] 
        
           include = ["pyarrow"] 
        
           exclude = ["pyarrow/vendored", "pyarrow/interchange", "pyarrow/tests/test_cuda*"] 
        
           [tool.ty.environment] 
        
           root = ["pyarrow"] 
        
           [tool.ty.rules] 
        
           unresolved-import = "ignore" 
        
           unresolved-attribute = "ignore"

I'll try to do another pass sometime next week and ask for a review on the discussions thread, but meanwhile feel free to do pass if you can spare the time :D

raulcd

@rok first, thanks for the huge amount of work here!! You rock! (see what I did there?)

I've just taken a look on the build/CI part of it.

As suggested yesterday, I think we should add documentation on the Python docs development guide both on how to run type checking and what is expected from pyarrow developers.

What is the workflow expected when working on it?

raulcd · 2025-10-31T09:50:38Z

.github/workflows/python.yml

+      - name: Type check with mypy and pyright
+        run: |-
+            python -m pip install mypy pyright ty griffe libcst pytest hypothesis fsspec scipy-stubs pandas-stubs types-python-dateutil types-psutil types-requests griffe libcst sphinx types-cffi
+            pip install -i https://pypi.anaconda.org/scientific-python-nightly-wheels/simple pyarrow


This is currently installing pyarrow from the nightlies, as part of CI we should test the pyarrow version on the PR / commit / branch.
On this job pyarrow is built via archery on a docker container, it won't be available outside the container. We might want to add a new job only for running type checking where the docker container also runs the type checking or add it to this jobs archery docker run, we can probably drive the check with an environment variable and run this directly on ci/scripts/python_test.sh.

Indeed. I've added /arrow/ci/scripts/python_test_type_annotations.sh (and windows equivalents) to call from compose.yaml:

arrow/compose.yaml

Lines 935 to 945 in 00047e4

environment:

<<: [*common, *ccache, *sccache]

PYTEST_ARGS: # inherit

PYARROW_TEST_ANNOTATIONS: "ON"

volumes: *conda-volumes

command: &python-conda-command

["

/arrow/ci/scripts/cpp_build.sh /arrow /build &&

/arrow/ci/scripts/python_build.sh /arrow /build &&

/arrow/ci/scripts/python_test.sh /arrow &&

/arrow/ci/scripts/python_test_type_annotations.sh /arrow/python"]

and python.yaml:

arrow/.github/workflows/python.yml

Lines 242 to 246 in 00047e4

- name: Test annotations

shell: bash

env:

PYARROW_TEST_ANNOTATIONS: "ON"

run: ci/scripts/python_test_type_annotations.sh $(pwd)

arrow/.github/workflows/python.yml

Lines 304 to 307 in 00047e4

- name: Test annotations

shell: cmd

run: |

call "ci\scripts\python_test_type_annotations.bat" %cd%\python

raulcd · 2025-10-31T09:52:05Z

.github/workflows/python.yml

+            pyright
+            ty check
+            cd ..
+            python ./dev/update_stub_docstrings.py -f ./python/pyarrow-stubs


Is the update_stub_docstrings.py something:

devs should run before pushing or via commit hook

CI should run on every PR

something we should run when building sdist/wheels

part of the release process

It's not clear to me.

update_stub_docstrings.py should run just prior wheel-build-time. See 03bdf10 for how I imagine wheel that would look like. Please note:

pyarrow must be built before packaging the wheel so docstrings of dynamic symbols become available and can be inserted into stubfiles

this also adds a test that checks that packaged wheel includes one of the stub files (compute.pyi) and that at least one of it symbols is annotated with a doscstring.

I've also added docs to development.rst but they may be too sparse. Please check.

rok · 2025-11-10T19:29:05Z

Thanks for the review @raulcd !

I've just taken a look on the build/CI part of it.

Do you think we should put build/CI part into a separate PR?

As suggested yesterday, I think we should add documentation on the Python docs development guide both on how to run type checking and what is expected from pyarrow developers.

What is the workflow expected when working on it?

As mentioned above - I've added some docs but they may be too sparse.

dangotbanned · 2025-11-18T11:34:45Z

#47609 (comment)

I'll try to do another pass sometime next week and ask for a review on the discussions thread, but meanwhile feel free to do pass if you can spare the time :D

Sorry for the delay on getting back to you @rok!

I've been putting in a lot of time working with pyarrow in (https://github.com/narwhals-dev/narwhals/pulls?q=sort%3Aupdated-desc+is%3Apr+%22%28expr-ir%29%2F%22+in%3Atitle) and collecting more issues along the way 🙂

Some fairly high-level things that might be worth checking:

Are all of the APIs in GH-48095: [Python][Docs] Add missing {pyarrow,compute} functions to API docs #48117 now in the stubs?
- IIRC, some of those were added in this PR - but I think there may still be gaps
Does every pyarrow.compute function that is annotated with Expression actually support them?
- The description provided in acero/user_guide/project appears to rule out everything that isn't listed in compute/element-wise-scalar-functions
- If you're able to remove Expression from functions that never accept or return them at runtime; maybe the stubs would shrink a bit? 😄

I'll try to dive into some more specific cases soon in a review - so this is just homework for you if you wanted it for now 😉

rok · 2025-11-18T17:17:43Z

Some fairly high-level things that might be worth checking:

Are all of the APIs in GH-48095: [Python][Docs] Add missing {pyarrow,compute} functions to API docs #48117 now in the stubs?

I went through the list. ~4 were missing.

Does every pyarrow.compute function that is annotated with Expression actually support them?

You might be right, this remains my homework to check.

rok · 2025-11-18T17:53:22Z

Removed scatter and inverse_permutation annotations as their options objects are not yet wrapped in Python. Opened #48167 to track progress.

rok · 2025-12-22T00:21:57Z

I've split out the CI part into a separate PR and will proceed to split it down further to enable review.

rok · 2025-12-22T19:54:37Z

I've added another the second PR (#48622) to follow #48618. I'll wait for those to merge before splitting this further.

rebase and some minor work dan's homework minor post rebase change lint annotation fix fix some annotations fix path on macos add type checking guidelines for developers package stubs into wheels and test for presence Add typechecking for Windows Add typechecking for macos Moving typechecks under 'Execute Docker Build' step test for pyarrow Review feedback some fixes remove some newlines fixes more fix fix cleanup test minor fix fix CI fix mypy minor fixes fix ty checks WIP WIP pyright for test_{pandas,scalars,schema,substrait}.py pyright for test_{sparse_tensor,substrait,tensor,types,udf,without_numpy}.py and util.py pyright for test_compute.py pyright for test_dataset.py pyright test_types.py pyright work yet further pyright work further pyright work Make pyright stricter workaround for shadowed types module bumpy python in pyright Update python/pyproject.toml Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> misc Update python/pyarrow-stubs/pyarrow/compute.pyi Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> ty reduce mypy errors experiment fix pyright config fixing missing-imports Some changes Update python/pyarrow-stubs/_compute.pyi Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> Apply suggestions from code review Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> minor adding some ignores to pass more checks Add CI check Add utility for adding docstrings into annotations Minor changes to pyarrow so some typechecks pass Add pyarrow-stubs minus their docstings

rok · 2026-02-09T19:37:53Z

Rebased this on main to make it easier to split further into smaller PRs.

github-actions · 2026-02-09T19:39:12Z

❌ GitHub issue #32609 could not be retrieved.

rok · 2026-02-09T21:02:41Z

You can track progress on subissues of #32609 or PRs here.

github-actions bot added the awaiting committer review Awaiting committer review label Sep 20, 2025

rok mentioned this pull request Sep 20, 2025

[Python] Gradually add type checks to Arrow, initial step rok/arrow#45

Closed

rok changed the title ~~[Python] Add type annotations to PyArrow~~ GH-32609: [Python] Add type annotations to PyArrow Sep 20, 2025

apache deleted a comment from github-actions bot Sep 20, 2025

rok force-pushed the pyarrow-stubs-2 branch from a0ce53c to 9c881b4 Compare September 20, 2025 20:09

github-actions bot added the Component: Python label Sep 20, 2025

apache deleted a comment from github-actions bot Sep 20, 2025

rok mentioned this pull request Sep 21, 2025

[Python] Setup type checking with mypy #24376

Open

rok requested review from pitrou and raulcd September 22, 2025 10:30

rok force-pushed the pyarrow-stubs-2 branch 5 times, most recently from b564265 to 127e741 Compare September 22, 2025 23:56

dangotbanned reviewed Sep 30, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 6, 2025

rok force-pushed the pyarrow-stubs-2 branch from 596fd29 to 6a31014 Compare October 6, 2025 17:09

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 6, 2025

rok force-pushed the pyarrow-stubs-2 branch 12 times, most recently from 90b3624 to 887f98e Compare October 24, 2025 21:55

rok force-pushed the pyarrow-stubs-2 branch 4 times, most recently from 80ea044 to 8e50a64 Compare October 25, 2025 22:08

raulcd reviewed Oct 31, 2025

View reviewed changes

rok mentioned this pull request Nov 19, 2025

chore: add pyarrow-stubs dependency for type hint & suggestion apache/iceberg-python#2768

Merged

vyasr mentioned this pull request Dec 12, 2025

[ENH] More type-stubs in the mypy pre-commit environment? rapidsai/cudf#11661

Open

		def field(*name_or_index: str \| tuple[str, ...] \| int) -> Expression: ...


		def scalar(value: bool \| float \| str) -> Expression: ...

	@staticmethod
	def _scalar(value):
	cdef:
	Scalar scalar

	if isinstance(value, Scalar):
	scalar = value
	else:
	scalar = lib.scalar(value)

	return Expression.wrap(CMakeScalarExpression(scalar.unwrap()))


	@property
	def kernels(self) -> list[ScalarKernel \| VectorKernel \| ScalarAggregateKernel \| HashAggregateKernel]:

	environment:
	<<: [common, ccache, *sccache]
	PYTEST_ARGS: # inherit
	PYARROW_TEST_ANNOTATIONS: "ON"
	volumes: *conda-volumes
	command: &python-conda-command
	["
	/arrow/ci/scripts/cpp_build.sh /arrow /build &&
	/arrow/ci/scripts/python_build.sh /arrow /build &&
	/arrow/ci/scripts/python_test.sh /arrow &&
	/arrow/ci/scripts/python_test_type_annotations.sh /arrow/python"]

	- name: Test annotations
	shell: bash
	env:
	PYARROW_TEST_ANNOTATIONS: "ON"
	run: ci/scripts/python_test_type_annotations.sh $(pwd)

	- name: Test annotations
	shell: cmd
	run: \|
	call "ci\scripts\python_test_type_annotations.bat" %cd%\python

Conversation

rok commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangotbanned left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rok commented Sep 30, 2025

Uh oh!

rok commented Oct 2, 2025

Uh oh!

rok commented Oct 24, 2025

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rok commented Nov 10, 2025

Uh oh!

dangotbanned commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rok commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rok commented Nov 18, 2025

Uh oh!

rok commented Dec 22, 2025

Uh oh!

rok commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rok commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

rok commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rok commented Sep 20, 2025 •

edited

Loading

dangotbanned commented Nov 18, 2025 •

edited

Loading

rok commented Nov 18, 2025 •

edited

Loading

rok commented Dec 22, 2025 •

edited

Loading